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DENOISING AUTOENCODERS FOR FAST COMBINATORIAL 
BLACK BOX OPTIMIZATION 

MALTE PROBST 


Abstract. Estimation of Distribution Algorithms (EDAs) require flexible 
probability models that can be efiiciently learned and sampled. Autoencoders 
(AE) are generative stochastic networks with these desired properties. We 
integrate a special type of AE, the Denoising Autoencoder (DAE), into an 
EDA and evaluate the performance of DAE-EDA on several combinatorial op¬ 
timization problems with a single objective. We asses the number of fitness 
evaluations as well as the required CPU times. We compare the results to 
the performance to the Bayesian Optimization Algorithm (BOA) and RBM- 
EDA, another EDA which is based on a generative neural network which has 
proven competitive with BOA. Eor the considered problem instances, DAE- 
EDA is considerably faster than BOA and RBM-EDA, sometimes by orders 
of magnitude. The number of fitness evaluations is higher than for BOA, but 
competitive with RBM-EDA. These results show that DAEs can be useful 
tools for problems with low but non-negligible fitness evaluation costs. 


1. Introduction 

Estimation of Distribution Algorithms (EDA, [2T1 [19]) are metaheuristics for 
combinatorial and continuous non-linear optimization. They maintain a popula¬ 
tion of solutions which they improve over consecutive generations. They estimate 
how likely it is that decisions are part of an optimal solution, and try to uncover the 
dependency structure between the decision variables. This information is obtained 
from the population by the estimation of a probabilistic model. If a model gener¬ 
alizes the population well, random samples drawn from the model have a structure 
and solution quality that is similar to the population itself. Repeated model estima¬ 
tion, sampling, and selection steps can solve difficult optimization problems. Simple 
models, such as factorizations of univariate frequencies, can be quickly estimated 
from a population, but they cannot represent interactions between decision vari¬ 
ables well. As a consequence, EDAs using univariate frequencies cannot efficiently 
solve complex problems. Using multivariate models allows complex problems to be 
solved, but fitting the model to a population and sampling new solutions can be 
very time-consuming. 

Recent work has shown that current models from machine learning such as the 
Restricted Boltzmann Machine (RBM), a stochastic neural network, can be used 
as probabilistic model for an EDA (31]. While not entirely matching the quality of 
the more statistics-driven Bayesian Optimization Algorithm (BOA, [28]), they have 
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other desirable properties: speed of training and sampling, and easy and efficient 
parallelization [CT[30] . 

We focus on another model from the field of machine learning, which is closely 
related to the RBM - the Autoencoder (AE, see e.g.[l4l|4]). Recent work has shown 
that AEs implicitly capture the probability distribution of given data, and that 
sampling this distribution is possible [6]. Although the AE is structurally similar 
to an RBM, the training procedure is simpler and computationally less expensive. 
Hence, they are even faster to train and sample. 

In this paper, we integrate a DAE in an EDA and assess its performance on mul¬ 
tiple standard benchmark problems from combinatorial optimization. We report 
both the number of fitness evaluations and the required CPU times. We include 
results for BOA, RBM-EDA, and a simple univariate method for comparison. 

Section [2] introduces EDAs, Autoencoders, shows how to use an AE within an 
EDA and briefly discusses a related approach. In Section [31 we present test prob¬ 
lems, reference algorithms, experimental setup, and results. We discuss the results 
in Section 0] and conclude the paper in Section |5l 

2. Preliminaries 

We review the basic concept of EDAs. We introduce Autoencoders, describe 
how to train and sample them, and show how an Autoencoder can be used in an 
EDA. 

2.1. Estimation of Distribution Algorithms. EDAs are well-established tools 
for solving combinatorial optimization problems (see e.g. pn [19]). The basic 
structure of EDAs is given by Algorithm [T] In a nutshell, they select promising 
individuals from a population, build a probabilistic model of this subpopulation and 
then use this model to sample new individuals. These new individuals are evaluated 
and usually form the new population. This loop continues until the population 
has converged. The underlying assumption is that a model, which has captured 
the essence of the old population, is able to sample new, unknown individuals 
that possess the same high-quality structure, thereby searching the solution space 
efficiently. 

EDAs differ in their choice of the model. Simple models use a vector with acti¬ 
vation probabilities for each variable of the problem, while neglecting dependencies 
between the variables, like UMDA or PBIL Hula]. Slightly more complex models 
use pairwise dependencies modeled as trees or forests [29] . More complex dependen¬ 
cies can be captured by models with multivariate interactions, like ECGA or BOA 
[niEH]. Multivariate models are better suited for complex optimization problems, 
as univariate models can cause an exponential growth of the required number of 
fitness evaluations for growing problem sizes (28] . Many algorithms use proba¬ 

bilistic graphical models with directed edges, i.e., Bayesian networks, or undirected 
edges, i.e., Markov random fields [18]. Hence, model building consists of finding a 
network structure that matches the problem structure and estimating the model’s 
parameters. Usually, the computational effort to build the model rises with model 
complexity and representational power. 

2.2. Autoencoders. This section shows how to train an AE, introduces the De- 
noising AE, and shows how to sample new solutions. 
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Algorithm 1 Pseudo code for main EDA loop 

1: Initialize Population P 
2: while not converged do 

3: Pparents ^ Select high-quality solutions from P based on their fitness 

4: M -ir- Build a model estimating the (joint) probability distribution of 

Pparents 

5: Pcandidates ^ Sample new Candidate solutions from M 

6- P Pparents D Pcandidates 

7: end while 



Figure 1. An Autoencoder as a graph. The visible neurons Xi 
{i G l..n) can hold a data vector of length n from the training data. 
In the EDA context, each Xi represents a decision variable. The 
hidden neurons hj (j G l..m) form a compressed representation 
of the input. The output neurons Zi [i G l..n) hold the AE’s 
reconstruction of the input. Weights W and W' fully connect x to 
h and h to z , respectively. 


2.2.1. Structure and Training Procedure. AEs are neural networks that have often 
been used for dimensionality reduction and are one of the building blocks for deep 
learning (see e.g. mum- They are, in essence, multi layer perceptrons, which 
is a very basic type of neural network (see e.g. [23]). 

An AE’s structure is defined by one visible layer x, at least one hidden layer h, 
and one output layer 2; (see Figure (Tj). The basic AE consists of two deterministic 
functions: the encoding function h = c(x;0) maps a given input, x G [0,1]”, to a 
hidden layer, h G [0,1]™, with parameters 6 and n,m G 'H. The decoding func¬ 
tion z = f{h-,9'), maps h back to a reconstruction z G [0,1]" in the input space. 
The training objective of the AE is to find parameters 9,6' which minimize the 
reconstruction error Err{x, z), i.e., the difference between x and z for all examples 
x',i G (1,...,t) in the training set: 


( 1 ) 


9,9' -.= argmin— 
e.e' T 


5^Err(x^z'). 

Z=1 


Common choices for Err(x, z) are the mean squared error function Err(x, z) = 
IJx — z\\^ or the cross entropy function Err(x, z) = — * log(2:fe) + (1 — Xk) * 

log(l - Zk)]. 
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Algorithm 2 Pseudo code for training an AE 

1: Initialize 9 = {W, bh, &z} randomly 

2 : Set 0 < a < 1, e.g. a = 0.1 

3: while not converged do 

4: for each example i in the training set do 

5: h = c{x^\9) 

6 : Z = f{h-e) 

7: 0 := 0 _ a * 

8 : end for 

9: end while 

10: (for training a DAE, replace x* with q{x'^\x^) in line 5) 


Encoding and decoding functions are usually chosen as c{x) = sigm(a; * W + bh) 
and f{h) = sigm(h * W + b^), where sigm(a:) = is the logistic function, W 

and W are weight matrices of size (n x m) and {mxn), respectively, and bh G K™, 
bz G R” are biases which work as offsets. Often, W and W are tied, i.e., W = . 

Then, the AEs configurable parameters are 9 = {W,bh,bz}. 

Minimizing (HD is performed by using a gradient descent algorithm (see Algorithm 
[2D. First, the parameters 9 are initialized to small, random values. Then, we repeat 
the following process for multiple epochs, i.e., passes through the training set: For 
each example x® in the training set we calculate the hidden layer, h = c(x®; 9), and 
the corresponding reconstruction z = f{h;9). We then change the parameters in 
the direction of the gradient, setting 


( 2 ) 


9 \= 9 — a * 


dErr(x, z) 


with learning rate 0 < a < 1. We stop the loop if the reconstruction error is 
small enough or another termination criterion has been met. Often, the parameter 
optimization is carried out using stochastic gradient descent, i.e., we use the average 
gradient from a mini-batch of b training examples to update 9. This usually speeds 
up learning and makes the gradient more stable [7]. 


2.2.2. Denoising AE. If the representational power of the hidden layer h is large 
enough (i.e., if m is not too small), a trivial way to solve (1) is to learn the identity 
function where each Xi is directly mapped to the corresponding Zi [2] . To force the 
model to learn a more useful representation, it is therefore often helpful to introduce 
a form of regularization [HE]. One example of a regularized AE is the Denoising 
Autoencoder (DAE) introduced in |35] . Here, each training example x is corrupted 
by a stochastic mapping x = q{x\x), i.e., we add random noise. Subsequently, 
the DAE calculates the reconstruction of the corrupted input, using encoding and 
decoding function, as z = f{c{x)). As with the original AE, the parameters are 
updated in the direction of Hence, the DAE tries to reconstruct x rather 

than X. The noise introduced by the corruption process q{-) also makes the model 
more robust to partially destroyed inputs [35]. 


2.2.3. Sampling a DAE. Classic AEs do not include a sampling process to generate 
new examples. However, recent work has shown that some variants of AEs, includ¬ 
ing the DAE, implicitly capture the structure of the data-generating density, and 
multiple sampling processes have been suggested and empirically validated (for an 
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Algorithm 3 Pseudo code for sampling a DAE 

1: Given the trained DAE’s 9 = {W,&c,&/} and its reconstruction function 
f(c{x)), the corruption process q{3:\x) 

2: Initialize x G [0,1]" randomly 

3: for a fixed number s of sampling steps do 

4: X = q{x\x) 

5: Z = /(c(x)) 

6: x := z 

7: end for 

8: Use cc as a sample from the DAE 


overview, see i)- Here, we adopt the sampling process proposed in [6], because it 
is the most general approach, and comes with a theoretical justification. 

Given a data-generating distribution, P{x), a corruption process, q{x\x) and 
a DAE that has been trained to reconstruct x from x, the sampling process is 
as follows (see Algorithm [3|) : First, we randomly initialize a sample x G [0,1]". 
Then, for s sampling steps, we corrupt the sample using the corruption process 
X = q{x\x) and use the trained DAE to reconstruct the input 2 = /(c(x)). For the 
next sampling step, we set x := z. After s sampling steps, we use x as a sample 
from the DAE. 

In [^, it was shown that this process converges to samples from the DAE’s 
approximation of the data-generating distribution, i.e., the training data. 

2.3. Using a Denoising Autoencoder in an EDA. We can use a DAE as 

probabilistic model for an EDA. In each generation of the EDA, we train a DAE 
to model the probability distribution of the solutions which survived the selection 
process. We then sample the DAE. Each sample is a vector x G [0,1]". To turn 
this vector of real-valued elements into a candidate solution, i.e., a binary string, 
we sample each variable x, from a Bernoulli distribution with p = x^. Then, we 
evaluate the fitness of the candidate solutions, and let the selection function decide 
which individuals will reach the next generation. 

Another approach for using a DAE in an EDA-like optimization process was 
recently suggested by [8] . Contrary to our approach, the DAE in [8] is not used as 
a multivariate EDA model to sample new solutions. Instead, it is trained on the 
best 10-20% of the population only. Subsequently, it is used to improve a second set 
of selected individuals from the population. Those individuals are first corrupted by 
the DAE’s corruption process, and then reconstructed by the DAE, using encoding 
and decoding function. 


3. Experiments 

We present test problems, reference algorithms, experimental setup, and results. 

3.1. Test Problems. We evaluate DAE-EDA on concatenated deceptive traps, 
NK landscapes and the HIFF function. All three are standard benchmark problems. 
Their difficulty depends on the problem size, i.e., problems with more decision 
variables are more difficult. Furthermore, the difficulty of concatenated deceptive 
trap functions and NK landscapes is tunable by a parameter. All three problems 
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are composed of subproblems, which are either deceptive (traps), overlapping (NK 
landscapes), or hierarchical (HIFF), and therefore multimodal. 

Concatenated deceptive traps are tunably hard, yet decomposable test problems 
[^. Here, a solution vector x is divided into I subsets of size k, with each one 
being a deceptive trap. Within a trap, all bits are dependent on each other but 
independent of all other bits in x. Thus, the Htness contribution of the traps can 
be evaluated separately and the total fitness of the solution vector is the sum of 
these terms. In particular, the assignment a = Xi-i+k-i (i-e., the k bits from Xi to 


2;i+fc-i)Q leads to a fitness contribution Fi as 


if Ei = fc) 


k 


Pi {a) 


k — + 1) otherwise. 


In other words, the fitness of a single trap increases with the number of zeros, except 
for the optimum of all ones. 

NK landscapes are defined by two parameters n and k and n fitness components 
fi,i G {I ... ,n} [16]. A solution vector x consists of n bits. The bits are assigned 
to n overlapping subsets, each of size fc +1. The fitness of a solution is the sum of n 
fitness components. Each component fi depends on the value of the corresponding 
variable Xi as well as k other variables. Each fi maps each possible configurations 
of its /c + 1 variables to a fitness value. The overall fitness function is 


n 



2=1 


Each decision variable usually influences several fi. These dependencies between 
subsets make NK landscapes non-separable. The problem difficulty increases with 
k. fc = 0 is a special case where all decision variables are independent and the 
problem reduces to a unimodal onemax. We use instances of NK landscapes with 
known optima from [26] . 

The Hierarchical If-and-only-if (HIEE) function [36] is defined for solutions vec¬ 
tors of length n = 2* where I €N is the number of layers of the hierarchy. It uses a 
mapping function M and a contribution function C, both of which take two inputs. 
The mapping function takes each of the n/2 blocks of two neighboring variables 
of level 1 = 1, and maps them onto a single symbol each. An assignment of 00 is 
mapped to 0, 11 is mapped to 1 and everything else is mapped to the null symbol 
The concatenation of M’s outputs on level I is used as M’s input for the next level 
Z -I-1 of the hierarchy, i.e., if level I = 1 has n variables, level I = 2 has n/2 variables. 
On each level, C assigns a htness to each block of two variables. The assignments 
00 and 11 are both mapped to 2^, everything else to 0. The total htness is the sum 
of all blocks’ contributions on all levels. In other words, a block contributes to the 
htness on the current level if both variables in a block have the same assignment. 
However, only if neighboring blocks agree on the assignment, they will contribute 
to the htness on the next level. HIEE therefore has two global optima, the string 
of all ones, and the string of all zeros. 


^The k variables assigned to trap I do not have to be adjacent, but can be at any position in 
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3.2. Reference Algorithms. We compare DAE-EDA to BOA [28], RBM-EDA, 
an EDA based on Restricted Boltzmann Machines m, and Population-Based In¬ 
cremental Learning (PBIL, [5]). 

3.2.1. Bayesian Optimization Algorithm. The Bayesian Optimization Algorithm is 
one of the state-of-the-art EDAs for discrete optimization problems. It was been 
proposed by and has been heavily used and researched since then [IZlIlSlil]- 

BOA uses a Bayesian network for modeling dependencies between variables. De¬ 
cision variables correspond to nodes and dependencies between variables correspond 
to directed edges. As the number of possible network topologies grows exponen¬ 
tially with the number of nodes, BOA uses a greedy construction heuristic to find 
a network structure G to model the training data. Starting from an unconnected 
(empty) network, BOA evaluates all possible additional edges, adds the one that 
maximally increases the fit between the model and selected individuals, and repeats 
this process until no more edges can be added. The fit between selected individuals 
and the model is measured by the Bayesian Information Criterion (BIG) [32] . BIC 
is based on the conditional entropy of nodes given their parent nodes and correction 
terms penalizing complex models. It can be calculated independently for all nodes. 
If an edge is added to the Bayesian network, the change of the BIC can be com¬ 
puted quickly. BOAs greedy network construction algorithm adds the edge with 
the largest BIC gain until no more edges can be added. Edge additions resulting 
in cycles are not considered. 

After the network structure has been learned, BOA calculates the conditional ac¬ 
tivation probability tables for each node. Once the model structure and conditional 
activation probabilities are available, BOA can produce new candidate solutions by 
drawing random values for all nodes in topological order. 

3.2.2. RBM-EDA. RBM-EDA uses a Restricted Boltzmann Machine as multivari¬ 
ate model. Restricted Boltzmann Machines are stochastic neural networks con¬ 
sisting of two layers of neurons, where the connections between the layers form a 
bipartite graph |33|. The input or visible layer x G [0,1]" of an RBM holds the 
input data represented by n binary variables. The second, hidden layer h G [0, Ij™ 
consists of m neurons. There is no dedicated output layer in an RBM. A weight 
matrix W holds weights Wij G K between all neurons Xi and hj. From a structural 
point of view, an RBM resembles an Autoencoder with one hidden layer where the 
output layer has been ’’folded” back onto the input layer. 

An RBM can be used as a model within an EDA, because it can be trained to 
model a probability distribution and it is possible to draw samples from this model 
[SKIIISI. Training the RBM means adjusting Wij s.t. the RBM models the 
probability distribution of the training data. This can be done by using the gradi¬ 
ent descent algorithm contrastive divergence |12] . Sampling new individuals from 
the model’s probability distribution can be performed using Gibbs sampling m- 

ED have shown that RBM-EDA is competitive to BOA. For difficult problems, 
it has a moderately higher, but still non-exponential complexity in the number 
of fitness evaluations. However, the time for solving problems grows slower with 
larger problem sizes. We compare DAE-EDA to RBM-EDA, because they are 
closely related in terms of the models’ structure and training process. 

3.2.3. PBIL. PBIL is one of the simplest EDAs. It assumes conditional indepen¬ 
dence of all n problem variables. PBIL stores a vector P = {pi,p 2 ,... ,Pn) of 
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activation probabilities. PBIL creates new individuals by sampling each variable 
from a Bernoulli distribution with p = pn- In each EDA generation t, PBIL selects 
the best p individuals yi, 1 / 2 , ■ ■ ■, 2 /fc from the population, and updates each as 



with 0 < a < 1 determining the strength of the update. We include PBIL in the 
experiment, because its results give an intuitive measure on the difficulty of the 
test problems. 

3.3. Experimental Setup. 

3.3.1. EDA Parametrization. We use several instances of the test problems (see 
section 1 ^ . For each instance and algorithm, we test multiple population sizes. 
For DAE-EDA, RBM-EDA, and BOA, we choose popsize £ {50; 100;...; 16,000}, 
for PBIL, we choose popsize £ {50; 100;...; 512,000}). We run 20 instances for 
each population size. 

In each run, the EDA is allowed to run for 100 generations (2000 for PBIL). We 
terminate the EDAs if there is no improvement in the best solution for more than 
20 generations (400 for PBIL). We report the average number of fitness evaluations 
and CPU time for the best solutions of all runs. All EDAs use tournament selection 
without replacement of size two | 20 j . 

For PBIL, we choose /i = 1 and a = .02., i.e., we use only the best individual 
in each generation to update the model. For the RBM, we use the same parameter 
settings as in m- 

The algorithms were implemented in Matlab/Octave and executed using Octave 
V3.2.4 on a on a single core of an AMD Opteron 6272 processor with 2,100 MHz. 

3.3.2. DAE Parametrization. We use the following parameters for the DAE: The 
number m of hidden neurons is equal to the problem size n. The corruption process 
q{x\x) randomly corrupts 10 % of the inputs by setting them to 0 or 1 (salt+pepper 
noise). When sampling new candidate solutions from the DAE, we perform s = 10 
sampling steps. During training, the learning rate a is 0.2, the batch size for 
stochastic gradient descent is 6 = 100. We use a cross-entropy error measure for 

(Ill- 

Like EH, we apply a simple parameter control scheme determining when to 
terminate DAE training. The scheme is based on the reconstruction error e = 
Err{x,z). e usually decreases with the number of epochs. Every second epoch 
t £ 1,... ,r, we calculate for a fixed subset u of the training set U the relative 
difference e“ = i/\u\^j^.^Err{x^, z^). We measure the decrease 7 of the recon¬ 
struction error in the last 33% of all epochs as 7 = {eQ,Qn — e“)/(eo — e“). 7 is 
then used to automatically check for convergence of the training. We stop training 
if 7 < 0.05. The rationale behind this is that the DAE has learned the relevant 
dependencies between the variables, and further training is unlikely to improve the 
model considerably. Furthermore, we stop the training if the DAE is overfitting, 
i.e., learning noise instead of problem structure. Therefore, we split the original 
training set into a training set U containing 90% of all samples and a validation set 
U' containing the remaining 10%. We train the DAE only for the solutions in U 
and, after each epoch, calculate the reconstruction error and for the training 
and validation set U and U', respectively. We stop the training phase as soon as 
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(|e“ — |)/e“ > 0.1 (i.e., the difference between the reconstruction errors is larger 

than 10%). 

3.4. Results. We report the performance of DAE-EDA, RBM-EDA, BOA, and 
PBIL for concatenated deceptive traps with A: = 4, n G {20,40,60} and fc = 5,n G 
{25,50,75}, NK landscapes with fc = 4,n G {30,34} and fc = 5,n G {30,34} (two 
instances i each) as well as the HIFF function with n G {64,128} (see table [1]). For 
each instance and algorithm, we select the minimal population size which leads to 
the optimal solution in at least 50% of the runs (left two result columns of table [J) 
and at least 90% of the runs (right two columns). We report the average number 
of fitness evaluations and CPU time of those runs. 

First, we analyze the number of fitness evaluations required. For all problems, 
and for both the runs with at least 50% and 90% success rate, respectively, PBIL 
uses the highest number of fitness evaluations. 

As expected, BOA has the best performance in terms of fitness evaluations. This 
is consistent with the previous findings comparing RBM-EDA and BOA [31]. 

Both DAE-EDA and RBM-EDA consistently use more fitness evaluations than 
BOA, except for the NK landscapes with n = 34, fc = 5. However, most of the time 
the number of fitness evaluations is on the same order of magnitude, and clearly 
better than that of the univariate PBIL. For the both the runs with at least 50% 
and 90% success rate, DAE-EDA and RBM-EDA are about tied for the number of 
instances with the least fitness evaluations. 

For the 128 bit HIFF problem DAE-EDA needed only 45% of the fitness eval¬ 
uations of the DAE inspired optimizer in |5|. We attribute this mainly to the 
sampling process, which samples from the trained model’s distribution directly, in¬ 
stead of using the DAE as a tool for an advanced local search modifying selected 
individuals. 

Second, we look at the average time the algorithms required to solve the respec¬ 
tive problem. If PBIL is able to solve the problem to optimality, it is usually the 
fastest algorithm, because fitness evaluations are computationally inexpensive for 
all benchmark problems. Note that for the 60-bit concatenated 4-trap problem, 
PBIL is able to find the optimal solution in at least 50% of the runs, nevertheless, 
the DAE-EDA is faster as it needs only about I%o of PBIL’s fitness evaluations. 

For all but one instance, DAE-EDA is significantly faster than both RBM-EDA 
and BOA, sometimes by multiple orders of magnitude (see Section Sj). This is also 
true for the instances where RBM-EDA needs a lower number of fitness evaluations. 
This is due to the much quicker model building of the DAE. 

4. Discussion 

The results suggest that DAE-EDA is able to decompose the test problems prop¬ 
erly, and solve the parts independently. This becomes evident when looking at the 
more complicated problems where the univariate PBIL struggles or fails. The qual¬ 
ity of DAE-EDAs underlying probabilistic model is similar to the one of RBM-EDA, 
but not as good as BOA’s. 

An interesting aspect of DAE-EDA is its speedy model building and sampling 
process. The CPU time to solve the test problems is much lower than that of the 
other multivariate methods, sometimes by multiple orders of magnitude. Note that 
the direct comparison of CPU times is not entirely fair for BOA. In a more effi¬ 
cient programming language instead of a script-based language like Matlab/Octave, 
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Problem 

Algorithm 

Pop 
in >50% of 
Evaluations 

Average 
Illation size such t] 

runs 

Time (sec) 

results 

lat optimum is found 

in >90% o 
Evaluations 

■ runs 

Time (sec) 

4-Traps 

20 bit 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

2,550±1,150 

20,300±7,753 

1,850*±382 

23,050±8,968 

18±6.9 

121±30 

117±26 

0.0*±0.3 

4,450±1,359 

20,300±7,753 

1,850*±382 

71,200±31,688 

21±9.2 

121±30 

117±26 

0.0*±0.4 

4-Traps 

40 bit 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

37,400±14,030 

45,000±14,346 

7,875*±1,035 

3,398,400±1,276,058 

90±28 

696±144 

1.963±304 

38*±14 

37,400±14,030 

56,000±9,633 

7,875*±1,035 

6,121,600±2,199,336 

90±28 

656±72 

1,963±304 

79±27 

4-Traps 

60 bit 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

61,800±22,225 

95,000±16,823 

18,600*±1,655 

71,884,800±19,416,482 

182*±58 

1.822±192 

10.658±2,321 

1.229±380 

292,000±55.857 

163,600±28.675 

18,600*±1,655 

823*±216 

1,842±268 

10,658±2,321 

5-Traps 

25 bit 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

11,650±5,350 

34,600±8,511 

9,550±2,061 

180,000±63,182 

44±13 

230±40 

1.099±299 

1.0*±0.6 

11,650±5,350 

48,800±14,288 

13,000±2,049 

564,000±265,083 

44±13 

231±35 

1,572±298 

4.0*±2.0 

5-Traps 

50 bit 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

57,750±18,250 

89,000±18,615 

34,600*±2,615 

200*±44 

841±180 

15.905±2,218 

57,750±18,250 

119,000±10,440 

43,800±3,341 

200*±44 

812±126 

20,196±2,552 

5-Traps 

75 bit 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

96,500±32,049 

218,000±32,496 

82,000±7,642 

519*±111 

2.235±219 

88.599±11,852 

247,500±45,373 

218,000±32,496 

132,400*±9,952 

1,297*±233 

2,235±219 

145,026±18,865 

NK n ^ 30, 
k — A, i — 1 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

10,725±3,976 

47,300±18,036 

9,500±1,975 

430,400±227,594 

52±12 

654±135 

1.185±265 

6.0*±3.6 

31,900±6,971 

55,000±7,362 

32,500±5,723 

79*±16 

787±133 

4,161±929 

NK n — 30, 
fc ^ 4, i ^ 2 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

49,100±13,423 

124,800±22,400 

45,500±12,114 

742,400±292,734 

109±24 

1.133±155 

6.105±1,741 

10*±4.3 

328,000±77,974 

238,400±40a28 

65,600*±11,056 

396*±108 

1,490±237 

12,854±2,877 

NK n ^ 34, 
k — A, i — 1 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

51,300±16,844 

48,800±14,918 

11,900*±1,678 

155,500±51,170 

126±31 

773±222 

1.858±413 

3.0*±1.2 

96,800±24,351 

63,000±8,473 

20,550*±3,556 

5,088,000±1,084,512 

171±46 

935±132 

3,296±738 

84*±22 

NK n ^ 34, 
fc ^ 4, f ^ 2 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

26,950*±8,176 

72,800±15,677 

42,700±8,032 

3,062,400±1,126,159 

99±23 

1.168±243 

7.908±1,908 

54*±20 

175,600±45,218 

127,600±13,260 

70,600*±9,902 

279*±81 

1,441±195 

16.500±3,364 

NK n — 30, 
fc ^ 5, f ^ 1 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

29,000±8,349 

7,525*±1,847 

11,825±1,434 

89,200±40,355 

98±15 

56±20 

1.429±283 

2.0*±1.2 

232,000±56,114 

44.500±7,124 

20,500*±3,892 

4,006,400±1,417,336 

323±82 

634±136 

2,742±634 

68*±22 

NK n — 30, 
k — 3, i — 2 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

99,600±12,706 

46,700±11,014 

22,150*±3,468 

688,800±263,708 

158±25 

770±153 

2.925±595 

11*±4.1 

182,400±40,917 

69,200±9,042 

40,200*±9,421 

240*±62 

990±144 

5,492±1,621 

NK n ^ 34, 
k — 5, i — 1 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

189,200±63,964 

79,800*±12,867 

319,200±44,535 

340*±122 

1.235±153 

97.558±18,726 

309,600±88,692 

149,200*±17,577 

548,800±84,830 

455*±136 

1,549±207 

178,667±38,349 

NK n ^ 34, 
k ^ 5, i^2 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

213,200±53,861 

138,400*±19,936 

280,000±43.377 

347*±101 

1.582±255 

85.876±15,708 

432.000±92,330 

288,000*±31,190 

436.800±86,029 

616*±143 

2,240±330 

144,185±33,991 

HIFF64 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

22,250±4,097 

43,200±4,578 

11,175*±729 

94*±15 

1.363±122 

6.562±1,068 

36,900±4,999 

75,800±7.318 

11,175*±729 

141*±14 

1.698±174 

6,562±1,068 

HIFF128 

DAE-EDA 

RBM-EDA 

BOA 

PBIL 

I03,400±ll,209 

285,200±19,964 

39,500*±3,122 

870*±103 

9.543±1,056 

101.067±16,547 

103,400±11,209 

966,400±85,026 

69,300*±3,913 

870*±103 

25,462±1,997 

163,750±22,829 


Table 1. This table shows average values for fitness evaluations and CPU time 
for DAE-EDA, RBM-EDA, BOA, and PBIL for the test problems. For each instance 
and algorithm, we selected the minimal population size which lead to the optimal 
solution in at least 50% of the runs (left two result columns) and at least 90% of the 
runs (right two columns). Results are averaged over 20 runs. Results marked with 
(*) are significantly smaller than other results in the respective table cell, according 
to pairwise Wilcoxon signed-rank tests (p < 0.01, data is not normally distributed) 
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BOA’s speedup is significantly higher than the one of DAE-EDA and RBM-EDA. 
However, almost every recent implementation of neural networks is parallelized on 
graphics processing units (GPU), which, in turn, speeds up training and sampling 
these models considerably (see e.g. [m [m |34]). Parallelizing multivariate EDAs 
such as BOA is well possible, however the speedups are often single- or double¬ 
digit, even on GPUs (see e.g. [Ml ES])- In contrast, parallelizing EDAs using 
neural networks can saturate modern GPU hardware and yield very high speedups: 
m report speedups of up to 200 x, against optimized CPU code, for RBM-EDA, 
which uses a neural network model that is closely related to the DAE. Hence, it 
is reasonable to assume that an efficient GPU-based implementation of DAE-EDA 
will be very fast, compared to other EDAs on problems with low, but non-negligible 
fitness evaluation costs. 

Regarding the model quality, we now exemplarily take a closer look at the results 
of two selected test instances, where DAE-EDA has a higher number of fitness 
evaluations than both BOA and RBM-EDA: the 75 bit concatenated 5-trap problem 
(Figure |2|) and the NK landscape with n = 34,fc = 4,i = 1 (Figure |3l). In both 
Figures, the left-hand side shows the number of fitness evaluations, the right-hand 
side shows the CPU time. Each data point of a problem marks the average results 
for a specific population size. Lines connect data points with adjacent population 
sizes. 

For the 75 bit 5-trap problem, we see that DAE-EDA approaches the optimum 
faster (i.e., with smaller population size, fewer fitness evaluations and less total 
time) than both RBM-EDA and BOA. For the NK landscape, DAE-EDA is com¬ 
parable to BOA. However, for both problems, the transition region from partial 
success to complete success is larger for DAE-EDA. In other words, DAE-EDA 
quickly finds optimal or close-to-optimal solutions with few evaluations in some 
runs, but often needs much larger populations for reliable convergence in every 
run. This pattern is qualitatively similar for other problems: On average, DAE- 
EDA needs 2.5 x the number of fitness evaluations to make >90% rather than >50% 
of runs converge, compared to 1.9 x for RBM-EDA and 1.6 x for BOA (see table 
|T|). This suggests that, in the current configuration, DAE-EDA is more dependent 
on the initialization of the DAE’s parameters 6. If, by chance, they are initialized 
in a particularly unfavorable way, the model is not able to learn an optimal hidden 
representation. The amount of random noise injected by corruption function q(-) 
is not sufficient to compensate for this effect. This hinders DAE-EDA from explor¬ 
ing the solution space more efficiently. This poses an interesting area for further 
research. 


5. Conclusion 

We introduced DAE-EDA, an Estimation of Distribution Algorithm which uses 
a Denoising Autoencoder as probabilistic model for solving combinatorial optimiza¬ 
tion problems. DAE-EDA uses a DAE to approximate the probability distribution 
of the fittest individuals and subsequently samples new candidate solutions from 
this distribution. We tested DAE-EDA using several instances of the standard 
benchmark problems concatenated deceptive traps, NK landscapes and the HIFF 
function. We compared the results to multiple other EDAs: The state-of-the-art 
Bayesian Optimization Algorithm, the multivariate RBM-EDA, another EDA based 
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Fitness Evaluations 



Figure 2. Number of fitness evaluations (left-hand side) and CPU 
time (right-hand side) for the 75 bit concatenated 5-Traps problem. 
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Figure 3. Number of fitness evaluations (left-hand side) and CPU 
time (right-hand side) for an NK landscape with n = 34,fc = 4 
(instance 1). 


on stochastic neural networks, which has shown to be computationally less expen¬ 
sive than the state of the art BOA for complicated problems, and the univariate 
PBIL. 

The results show that DAE-EDA is a very fast EDA that is able to decompose 
complicated problems. It needs a similar number of fitness evaluations like RBM- 
EDA, but does not reach the model quality of BOA. However, it is much faster 
than both RBM-EDA and BOA, as training and sampling the probabilistic model 
are conceptually simpler and computationally cheaper. Furthermore, a DAE is 
structurally similar to an RBM. Hence, we can assume the speedup of running a 
parallelized version of DAE-EDA on modern graphics processing units to be very 
high. 

In sum, DAE-EDA can be a useful tool for solving complex combinatorial opti¬ 
mization problems, where fitness evaluation costs are low, but non-negligible. 

There are multiple directions for further research. The results suggest that it 
could be benehcial to look into the parameter initialization more thoroughly, as 
DAE-EDA’s performance seems to be more susceptible to unfavorable initial pa¬ 
rameters. Another promising direction is to use a deep, multi-layered DAE to solve 
hierarchical problems. Also, other techniques for sampling a DAE exist, which may 
result in a different performance of DAE-EDA. 
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